In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.
In this exercise, you will:
More instructions provided below. Include your work in this notebook and submit to your Github account.
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import bokeh.plotting as bkp
from mpl_toolkits.axes_grid1 import make_axes_locatable
In [2]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')
In [3]:
# deal with missing and inconvenient portions of data
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')
In [4]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)
ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)
ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)
ax.grid(True)
fig.tight_layout()
Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.
A. Initial observations based on the plot above
B. Statistics
C. Conclusions
D. Regulatory policy recommendations
Include your work on the following in this notebook and submit to your Github account.
A. Do you agree with the above analysis and recommendations? Why or why not?
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:
You can compose in notebook cells using Markdown:
In [5]:
clean_hospital_read_df.head()
Out[5]:
In [6]:
clean_hospital_read_df.info()
In [7]:
hospital_dropna_df = clean_hospital_read_df[np.isfinite(clean_hospital_read_df['Excess Readmission Ratio'])]
hospital_dropna_df.info()
The analysis seems to base its conclusion on one scatterplot. I cannot agree with the above analysis and recommendations because there is not enough evidence to support it. Further investigation and statistical analysis should be completed to determine whether there is a significant correlation between hospital capacity (number of discharges) and readmission rates. It may be possible that the correlation may have come from chance.
$H_0$: There is no statistically significant correlation between number of discharges and readmission rates.
$H_A$: There is a statistically significant negative correlation between number of discharges and readmission rates.
In [8]:
number_of_discharges = hospital_dropna_df['Number of Discharges']
excess_readmission_ratio = hospital_dropna_df['Excess Readmission Ratio']
pearson_r = np.corrcoef(number_of_discharges, excess_readmission_ratio)[0, 1]
print('The Pearson correlation of the sample is', pearson_r)
In [9]:
permutation_replicates = np.empty(100000)
for i in range(len(permutation_replicates)):
number_of_discharges_perm = np.random.permutation(number_of_discharges)
permutation_replicates[i] = np.corrcoef(number_of_discharges_perm, excess_readmission_ratio)[0, 1]
p = np.sum(permutation_replicates <= pearson_r) / len(permutation_replicates)
print('p =', p)
The p value was calculated to be extremely small above. This means our null hypothesis ($H_0$) should be rejected and the alternate hypothesis is more likely. There is a statistically significant negtive correlation between number of discharges and readmission rates. However the correlation is small as shown by the pearson correlation above of -0.097.
Statistical significance refers to how unlikely the Pearson correlation observed in the samples have occurred due to sampling errors. We can use it in hypothesis testing. Statisical significance tells us whether the value we were looking at, (in this case - pearson correlation) occured due to chance. Because the p value we calculated was so small, we can conclude that the correlation was statistically significant and that there likely is a negative correlation between number of discharges and readmission rates.
Practical significance looks at whether the Pearson correlation is large enough to be a value that shows a significant relationship. Practical significance looks at the value itself. In this case, the Pearson correlation was very small (-0.097). There probably is a negative correlation. However it is so small and close to zero that it is not very significant to the relationship between number of discharges and readmission rates.
Overall, size or number of discharges is not a good predictor for readmission rates. The recommendations above should not be followed. Instead further analysis should be performed to figure out what data has a higher correlation with readmission rates.
This plot above is able to display all of the data points at once. Sometimes a quick visual like this will indicate a relationship. However, it is hard to tell with this plot alone due to the data. If the plot included a line which showed the value of the negative correlation, it would more clearly present the relationship of the data.
In [10]:
import seaborn as sns
sns.set()
plt.scatter(number_of_discharges, excess_readmission_ratio, alpha=0.5)
slope, intercept = np.polyfit(number_of_discharges, excess_readmission_ratio, 1)
x = np.array([0, max(number_of_discharges)])
y = slope * x + intercept
plt.plot(x, y)
plt.xlabel('Number of discharges')
plt.ylabel('Excess rate of readmissions')
plt.title('Scatterplot of number of discharges vs. excess rate of readmissions')
plt.show()